An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation
نویسندگان
چکیده
Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with morphological disambiguation of the Hebrew language, which combines morphemes into a word in both agglutinative and fusional ways. We present an unsupervised stochastic model – the only resource we use is a morphological analyzer – which deals with the data sparseness problem caused by the affixational morphology of the Hebrew language. We present a text encoding method for languages with affixational morphology in which the knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation, in such a way that segmentation and tagging can be learned in parallel in one step. Results on a large scale evaluation indicate that this learning improves disambiguation for complex tag sets. Our method is applicable to other languages with affix morphology.
منابع مشابه
EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)
We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(t|w). We test the method on the task of full morphological disambiguation in Hebrew achieving an error reductio...
متن کاملEM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)
We address the task of unsupervised POS tagging. We demonstrate that good results can be obtained using the robust EM-HMM learner when provided with good initial conditions, even with incomplete dictionaries. We present a family of algorithms to compute effective initial estimations p(t|w). We test the method on the task of full morphological disambiguation in Hebrew achieving an error reductio...
متن کاملHebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach
viii List of Figures xiii
متن کاملWord-Based or Morpheme-Based? Annotation Strategies for Modern Hebrew Clitics
Morphologically rich languages pose a challenge to the annotators of treebanks with respect to the status of orthographic (spacedelimited) words in the syntactic parse trees. In such languages an orthographic word may carry various, distinct, sorts of information and the question arises whether we should represent such words as a sequence of their constituent morphemes (i.e., a Morpheme-Based a...
متن کاملData-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies
Parsing texts into universal dependencies (UD) in realistic scenarios requires infrastructure for morphological analysis and disambiguation (MA&D) of typologically different languages as a first tier. MA&D is particularly challenging in morphologically rich languages (MRLs), where the ambiguous space-delimited tokens ought to be disambiguated with respect to their constituent morphemes. Here we...
متن کامل